Counting Quail, a second time

For all problem sets, we will provide you with a single R Markdown file. Please complete the problem set within a local copy of this file. To turn in, please upload a fully knitted html version. Make sure to keep echo=TRUE, as appropriate, to show your coding.

In Problem Set 6 you were introduced to a stunning dataset analyzing count data for the Angeleno Quail (Callipepla curtii) collected in 2015. As a reminder, three times during the breeding season, surveyors visited 267 survey points doing targeted surveys for the quail.

Last time, you fit a mixed-effects model to the data and found strong effects of elevation, forest cover, an interaction, and wind. You noticed, however, that your dataset included a lot of zeroes, but you didn’t let that disturb you! After this week’s lessons, however, you’ve become a bit more worried, and you wish to try fitting the data in a way that accounts for imperfect detection. Since you have count data, you decide to use a N-mixture model. Like like an occupancy model, but for abundance data, the N-mixture model separates the underlying abundance state process (here, the factors that determine quail abundance) from the observation process (here, how imperfect detection reduces the number of quail actually seen), and models the data as a mixture of these two processes.

After talking to the biologists who surveyed the species, they tell you that they believe that forest cover and elevation likely influence quail abundance, while wind is a very strong factor in their ability to detect quail (but as a variable that changes by the day, wind doesn’t influence their true underlying abundance). The biologists are unsure whether canopy cover (which correlates with ground cover, as well as auditory transmission) impacts detection probability, so it’s probably safer to test for it. You use this natural history knowledge to guide you in your model parameterization.

The data are identical as used for Problem Set 6, so re-load in the R data object.

load("ProblemSet6.Rdata")

Your goal is to run a single N-mixture model in JAGS and learn what you can about the species from the fitted model. The model should include all the major components of this week’s problem set: the effect of elevation, forest cover, the interaction, and wind. Note, however, that you do not have to include a random effect for site with this N-mixture model, because N-mixture – like occupancy – models, assumes that the true underlying state does not change across repeat visits. Thus the “pseudoreplication” of repeat visits is what informs the detection parameter, but abundance covariates are parameterized on the latent variable, \(N\), which is assumed to be unchanging across visits (an assumption called “closure”). You should realize, consequently, that the true abundance at a site (\(N_j\)) is only estimated once for each of \(j\) sites. Consequently, your model should follow the general structure of an N-mixture model, which is:

\(y_{ij}\sim binomial(p_{jk}, N_j)\)

\(N_j \sim Poisson(\lambda_j)\)

where \(p_{jk}\) is a probability of detection at site \(j\) and visit \(k\) that can be modeled as a function of covariates, and \(\lambda_j\) is the expected true abundance at site \(j\), which can also be modeled as a function of covariates.

  1. Draw a DAG for the model you are about the create. Display your DAG below using the ggdag R package.

  2. Using your DAG, write out the JAGS model code for this model. Use appropriate vague priors for all parameters. You will need to create two nested for-loops, a first loop over every site j, and a second loop over each visit k. Thus, your process model should be for every count[j,k]. Keep in mind that your deterministic process and observation models will use different link functions, which means that a truly ‘vague’ prior for your parameters may differ depending on whether they are an abundance parameter or a detection parameter.

qs_mod<-function(){
  b0 ~ dnorm(0,0.00001)
  b_elev ~ dnorm(0,0.00001)
  b_forest ~ dnorm(0,0.00001)
  b_int ~ dnorm(0,0.00001)
  alpha_0 ~ dnorm(0,0.00001)
  alpha_wind ~ dnorm(0,0.00001)
  alpha_forest ~ dnorm(0,0.00001)
  tau_site ~ dgamma(0.001,0.001)
  sigma_site <- 1 / sqrt(tau_site)
  for (j in 1:site_num){
    N[j] ~ dpois(lam[j])
    log(lam[j])<-b0+b_elev*elev[j]+b_forest*forest[j]+b_int*elev[j]*forest[j]+b_site[j]
    b_site[j] ~ dnorm(0,tau_site)
    for (k in 1:visit_num){
      logit(p[j,k])=alpha_0+alpha_wind*wind[j,k]+alpha_forest*forest[j]
      count[j,k] ~ dbinom(p[j,k],N[j])
      count.new[j,k] ~ dbinom(p[j,k],N[j])
    }
  }
}
  1. Given that we ran an analysis last week and estimated parameters for this species. Would it be advantageous to use the findings from last week’s analysis as informed priors for the current analysis? Why or why not?
# I do not think it would be advantageous to use the findings from last week's analysis as informed priors for the current analysis. The issue is that in last week's analysis, one of the underlying assumptions is that there is perfect observation of the individuals within the community. However, in this week we are adding imperfect observation into the model. Thus, if we use last week's estimated parameters as priors, the priors may not necessarily cover all the posterior distribution of the parameters this week. More specifically, it is very likely that the posterior distribution for the covariates related to counts, such as b_0 and b_forest, that we gained from this week would be consistently larger than what we got last week due to the effect of imperfect observation being added in. 

#Furthermore, we are also removing the effect of wind from the covariates of counting; thus, this would change the structure of the model, which would cause the priors to be inaccurate to use in this week's analysis.
  1. Prepare your list of data to send to JAGS and run your JAGS model. Check to make sure it has converged. If the convergence is skeptical, change your MCMC parameters so that the model convergences. As discussed in class and shown in Lab 7, if you receive a Node inconsistent with parents error from JAGS, this indicates that you need to give JAGS plausible initial values for the latent variable, N.
set.seed(2025)
count=ps.data[["count"]]
elevation=ps.data[["elev"]]
forest=ps.data[["forest"]]
wind=ps.data[["wind"]]
avg_count=apply(count,1,mean)
site_num=nrow(wind)
visit_num=ncol(wind) 
site=rep(1:site_num,each=visit_num)
visit=rep(1:visit_num,times=site_num)
total_sample=site_num*visit_num

jags_data_qs<-list(
  site_num=site_num,
  visit_num=visit_num,
  count=count,
  wind=wind,
  elev=elevation,
  forest=forest
)

qs.fit<-jags(data=jags_data_qs,parameters.to.save=c("b0", "b_elev","b_forest","b_int","tau_site","sigma_site","alpha_0","alpha_wind","alpha_forest"),model.file=qs_mod,
             inits=list(list(N=apply(count,1,max)+1),
                        list(N=apply(count,1,max)+1),
                        list(N=apply(count,1,max)+1)),
              n.chains=3,
              n.iter=20000,
              n.burnin=5000,
              n.thin=10)
## module glm loaded
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 801
##    Unobserved stochastic nodes: 1343
##    Total graph size: 7492
## 
## Initializing model
qs.check<-jags(data=jags_data_qs,parameters.to.save=c("count.new","p"),model.file=qs_mod,
              inits=list(list(N=apply(count,1,max)+1),
                        list(N=apply(count,1,max)+1),
                        list(N=apply(count,1,max)+1)),
                    n.chains=3,
                    n.iter=20000,
                    n.burnin=5000,
                    n.thin=10)
## Compiling model graph
##    Resolving undeclared variables
##    Allocating nodes
## Graph information:
##    Observed stochastic nodes: 801
##    Unobserved stochastic nodes: 1343
##    Total graph size: 7492
## 
## Initializing model
MCMCsummary(qs.fit) 
##                       mean           sd          2.5%           50%
## alpha_0        -0.89359679   0.11327924   -1.12064343 -8.930820e-01
## alpha_forest    0.01305130   0.16897255   -0.29803828  9.500327e-03
## alpha_wind     -2.98527991   0.13919744   -3.24938669 -2.989302e+00
## b0              0.62427944   0.08908459    0.44919141  6.251070e-01
## b_elev         -2.05852386   0.11907291   -2.29740959 -2.058708e+00
## b_forest        2.14059057   0.12068901    1.90594868  2.143818e+00
## b_int           1.29513484   0.16953891    0.96008017  1.295504e+00
## deviance     1323.42152541  28.44176862 1269.67480612  1.322952e+03
## sigma_site      0.07152209   0.03542327    0.02278854  6.483815e-02
## tau_site      425.04040092 522.73293794   41.02532164  2.378695e+02
##                     97.5% Rhat n.eff
## alpha_0        -0.6728587 1.01   180
## alpha_forest    0.3376666 1.00   226
## alpha_wind     -2.7015073 1.01   238
## b0              0.7980839 1.01   587
## b_elev         -1.8250669 1.00   960
## b_forest        2.3740167 1.00   665
## b_int           1.6244017 1.00  1043
## deviance     1381.8100583 1.02   216
## sigma_site      0.1561256 1.00  1422
## tau_site     1925.6471315 1.00  2805
# most of the Rhat are either 1 or close to 1 thus the model has converged
#However, there are count.new values that have Rhat NaN, which could be problematic?
  1. Make traceplots for your slope and intercept parameters. Check Prior-Posterior Overlap. Note that you may wish to do that in two different commands, if you used multiple priors in your model.
traceplot(qs.fit)

#as in last week we only calculate hte PPO of fixed effect, excluding the random effect.
pr_norm=rnorm(15000,mean=0,sd=sqrt(1/0.00001))
MCMCtrace(qs.fit,params="b0",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b0", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="b_elev",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_elev", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="b_int",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_int", priors = pr_norm, pdf = FALSE):
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="b_forest",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "b_forest", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="alpha_0",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_0", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="alpha_wind",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_wind", priors = pr_norm, pdf =
## FALSE): Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

MCMCtrace(qs.fit,params="alpha_forest",priors=pr_norm,pdf=FALSE)
## Warning in MCMCtrace(qs.fit, params = "alpha_forest", priors = pr_norm, :
## Number of samples in prior is greater than number of total or specified
## iterations (for all chains) for specified parameter. Only last 4500 iterations
## will be used.

  1. Your traceplots should look quite different for the parameters for detection versus the parameters for abundance (e.g., compare traceplots for the effect of forest on abundance vs detection). Why should MCMC be more or less efficient for abundance versus detection processes?
#MCMC for abundance is much more efficient than the detection processes(looks more like a hairy caterpillar, while that for the detection processes are much more diffused). I think one of the main reasons is that the detection parameters are highly correlated to N, hence the value of the abundance parameters in each MCMC step. Thus, as a result, the parameters in the detection processes would demonstrate a higher interrelation between steps when put into an MCMC process: besides the autocorrelation that is introduced by the MCMC process itself, they are also gaining autocorrelation from the MCMC process of the abundance parameters. Thus, the MCMC tends to be less efficient for the detection processes.
  1. Using the posterior summary, what can we learn about our hypothesized variables of interest?
MCMCsummary(qs.fit)
##                       mean           sd          2.5%           50%
## alpha_0        -0.89359679   0.11327924   -1.12064343 -8.930820e-01
## alpha_forest    0.01305130   0.16897255   -0.29803828  9.500327e-03
## alpha_wind     -2.98527991   0.13919744   -3.24938669 -2.989302e+00
## b0              0.62427944   0.08908459    0.44919141  6.251070e-01
## b_elev         -2.05852386   0.11907291   -2.29740959 -2.058708e+00
## b_forest        2.14059057   0.12068901    1.90594868  2.143818e+00
## b_int           1.29513484   0.16953891    0.96008017  1.295504e+00
## deviance     1323.42152541  28.44176862 1269.67480612  1.322952e+03
## sigma_site      0.07152209   0.03542327    0.02278854  6.483815e-02
## tau_site      425.04040092 522.73293794   41.02532164  2.378695e+02
##                     97.5% Rhat n.eff
## alpha_0        -0.6728587 1.01   180
## alpha_forest    0.3376666 1.00   226
## alpha_wind     -2.7015073 1.01   238
## b0              0.7980839 1.01   587
## b_elev         -1.8250669 1.00   960
## b_forest        2.3740167 1.00   665
## b_int           1.6244017 1.00  1043
## deviance     1381.8100583 1.02   216
## sigma_site      0.1561256 1.00  1422
## tau_site     1925.6471315 1.00  2805
#The mean of b0 is 0.624, meaning that the average count for all sites, without the consideration of all the other effect is e^0.624

#Elevation has a highly-certain and strong negative impact on quail counts. The mean of b_elev is -2.059. Thus, when elevation increase by 1, the mean count of the site will change by a factor of e^-2.059

#Both canopy cover and the ineraction between canopy cover and elevation have a positive impact on the count fo quails, and we are very much confident about this effect. On average, when canopy cover and the interaction between canopy cover and elevation increase by 1 unit, the expected abundance of quails at that site is increase by a factor of e^2.14 and e^1.29, respectively

#Sigma_site is positive but relative low: 0.0715, suggesting that most of the variation has been explained by the existing parameters and site  do not contain any hidden variables that offer a strong explanatory power to the count of quails.

#alpha_0's effect on observation probability is strongly negative with high certainty. The mean of alpha_0 is -0.894, thus the average odd (not probability) of successful observation without considering other effect is e^-0.894.

#alpha_wind has a very strong negative and highly certain effect on observation success. The mean of alpha_wind is -2.99, suggesting that the average odd of successful observation would change by a factor of e^-2.98 when wind increase by 1 unit. Thus we are highly confident that wind negatively impacts our observation success for quails

#alpha_forest has a 95% confidence interval overlap with 0, thus we are uncertain whether it has any actual impact on the observation success or not. Moreover the mean value is small: 0.0131. Thus it is safe to say that canopy cover does not actually contribute much to the detection process of quails
  1. Let’s double check that this model is an adequate fit to the data. Last week we conducted two posterior predictive checks (PPC), using a test statistic of mean(y) and sd(y). Remember that the model fit well for the mean but not the sd. Repeat this now, with our N-mixture model, by calculating the observed and posterior distributions for both test statistics, and plot each (make sure your plots show both observed lines and the full posterior). Calculate the Bayesian p-value for each as well.
ppd=MCMCchains(qs.check,params="count.new")
ppd_mean=apply(ppd, 1, mean)
ppd_sd=apply(ppd, 1, sd)
obs_mean=mean(count)
obs_sd=sd(count)
df=data.frame(mean=ppd_mean,sd=ppd_sd)

hist(ppd_mean,breaks=20,main="Posterior Prediction Check for Mean",xlab="Simulated Means",ylab="Frequency",xlim=c(min(ppd_mean,obs_mean)-1,max(ppd_mean,obs_mean)+1))
abline(v=obs_mean,col="red",lwd=2,lty=2)
legend("topright",legend ="Observed mean",col="red",lty=2,lwd=2)

hist(ppd_sd,breaks=20,main="Posterior Prediction Check for SD",xlab="Simulated Sd",ylab="Frequency",xlim=c(min(ppd_sd,obs_sd)-1,max(ppd_sd,obs_sd)+1))
abline(v=obs_sd,col="red",lwd=2,lty=2)
legend("topright",legend ="Observed Standard Deviation",col="red",lty=2,lwd=2)

p_mean=mean(ppd_mean>obs_mean)
p_sd=mean(ppd_sd>obs_sd)

p_mean
## [1] 0.4791111
p_sd
## [1] 0.6746667
#The p values are 0.479 and 0.675
#Thus the model is becoming better in predicting both the mean and standard deviation of the observed dataset
  1. Using your posterior, create three plots (a, b, c), each showing how true quail abundance is predicted to change with forest cover. Plot (a) should show “low elevation” (elev = -1), (b) should show middle elevation (elev = 0), and (c) should show high elevation (elev = 1). Your y-axis should be “Predicted abundance” and your x-axis should be “Elevation (scaled)”. Label each plot. Each plot should clearly show a median prediction line, as well as a 89% credible interval in prediction around that median (to accomplish this, you will have to derive predictions from the full posterior sample of your parameters).
ppd=MCMCchains(qs.check,params="count.new")
b0=MCMCchains(qs.fit,params="b0")
b_elev=MCMCchains(qs.fit,params="b_elev")
b_forest=MCMCchains(qs.fit,params="b_forest")
b_int=MCMCchains(qs.fit,params="b_int")

forest_cov_pred=seq(min(forest),max(forest),by=0.01)
elev_discrete=c(-1,0,1)

plot_list <- list()
for (i in elev_discrete){
  median_ppd=c()
  lower_ppd=c()
  upper_ppd=c()
  elev_val_used=c()
  prediction=c()
  for (j in 1:length(forest_cov_pred)){
  forest_cov_val=forest_cov_pred[j]
  count=exp(b0+b_elev*i+b_forest*forest_cov_val+b_int*i*forest_cov_val)
  lower_ppd[j]=quantile(count,0.055)
  upper_ppd[j]=quantile(count,0.945)
  elev_val_used[j]=i
  prediction[j]=median(count)
  }
  plot_list[[as.character(i)]] <- data.frame(
    forest_cover=forest_cov_pred,
    median_pred=prediction,
    lower=lower_ppd,
    upper=upper_ppd,
    elevation=elev_val_used)
}
ggplot(plot_list[["-1"]],aes(x=forest_cover))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=-1)",x="Forest Cover",y="Predicted Abundance",col="Key")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

ggplot(plot_list[["0"]],aes(x=forest_cover))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=0)",x="Forest Cover",y="Predicted Abundance",col="Key")

ggplot(plot_list[["1"]],aes(x=forest_cover))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Elevation=1)",x ="Forest Cover",y = "Predicted Abundance",col="Key")

  1. Compare your plots for #9 above to your similar plots from Problem Set 6, where you modeled this data using a GLMM. How are the plots similar? How are they different? If there are major differences, explain why the N-mixture model is producing these differences?
#The plots look similar in terms of their shape. This is reasonable, as most of the biological processes in terms of how elevation and forest cover are describing the abundance are similar between the two models; thus, they should be similar.

#However, one thing to notice is that the plots above predict much more count per site when compared to what was produced in PS6. This is because we have now taken into account the detection process; thus, the number of quails per site needs to be larger than the amount of quails that are actually observed/the amount of quails when observation is perfect, which is the underlying assumption in PS6. Thus, the counts in the plots above are much higher than those in PS6
  1. Unlike our GLMM model, our N-mixture model is modeling two key parameters: true abundance (\(N\)) and probability of detection (\(p\)). Similar to question #9, create a predicted response plot showing the median response and 89% credible ribbon, but this time, plot the probability of detection as a function of wind.
a0=MCMCchains(qs.fit,params="alpha_0")
a_forest=MCMCchains(qs.fit,params="alpha_forest")
a_wind=MCMCchains(qs.fit,params="alpha_wind")
p_val_ppd=MCMCchains(qs.check,params="p")

wind_cov_pred=seq(min(wind),max(wind),by=0.01)
forest_discrete=c(-1,0,1)

plot_list <- list()
for (i in forest_discrete){
  median_ppd=c()
  lower_ppd=c()
  upper_ppd=c()
  forest_val_used=c()
  prediction=c()
  for (j in 1:length(wind_cov_pred)){
  wind_val=wind_cov_pred[j]
  p_val=exp(a0+a_forest*i+a_wind*wind_val)/(1+exp(a0+a_forest*i+a_wind*wind_val))
  lower_ppd[j]=quantile(p_val,0.055)
  upper_ppd[j]=quantile(p_val,0.945)
  forest_val_used[j]=i
  prediction[j]=median(p_val)
  }
  plot_list[[as.character(i)]] <- data.frame(
    wind_val_plot=wind_cov_pred,
    median_pred=prediction,
    lower=lower_ppd,
    upper=upper_ppd,
    forest_val_plot=forest_val_used)
}
ggplot(plot_list[["-1"]],aes(x=wind_val_plot))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=-1)",x="Wind",y="Probability of observation",col="Key")

ggplot(plot_list[["0"]],aes(x=wind_val_plot))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=0)",x="Wind",y="Probability of observation",col="Key")

ggplot(plot_list[["1"]],aes(x=wind_val_plot))+
  geom_ribbon(aes(ymin = lower, ymax = upper,col="89% CI"), fill="red",alpha=0.3)+
  geom_line(aes(y = median_pred, col="Mean"), size = 1)+
  labs(title = "Predicted Quail Abundance with 89% Confidence Interval (Forest Cover=1)",x="Wind",y="Probability of observation",col="Key")